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1 Examination of a memory access classification scheme for pointer-intensiv 

Luddy Harrison 

January 1996 Proceedings of the 10th international conference on Supercom 
Full text available:® pdf(99 1.11 KB) Additional Information: fuil citation, references, citing 



Keywords: CPU architecture, data cache, instruction profiling, memory access 
tolerance 



2 A general framework for prefetch scheduling in linked data structures and i 
prefetching 

Seungryul Choi, Nicholas Kohout, Sumit Pamnani, Dongkeun Kim, Donald Yeunc 
May 2004 ACM Transactions on Computer Systems (TOCS), Volume 22 Issu< 

Full text available:^ pdf(2.45 Sv18) Additional Information: Tuli citation, abstract, reference 

Pointer-chasing applications tend to traverse composite data structures consis 
chains. While the traversal of any single pointer chain leads to the serializatio 
independent pointer chains provides a source of memory parallelism. This arti 
interchain memory parallelism for the purpose of memory latency tolerance, u 
prefetching. Previous work ... 

Keywords: Data prefetching, memory parallelism, pointer-chasing code 
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3 Speeding up irregular applications in shared-memory multiprocessors: mer 
Zheng Zhang, Josep Torrellas 

May 1995 ACM SIGARCH Computer Architecture News , Proceedings of the 22nd < 
Computer architecture, Volume 23 Issue 2 

Full text available:"® pdf(1 .74 MB) Additional Information: full citation, abstract, references, ci 

While many parallel applications exhibit good spatial locality, other important 
problem-solving or CAD do not. Often, these irregular codes contain small reo 
Consequently, while the former applications benefit from long cache lines, the 
solution is to combine short lines with prefetching. In this way, each applicatii 
locality that it has. However, prefetching, if provided, ... 

4 Compiler scheduling: Compiler managed micro-cache bypassing for high p 
Youfeng Wu, Ryan Rakvic, Li-Ling Chen, Chyi-Chang Miao, George Chrysos, Jes< 
November 2002 Proceedings of the 35th annual ACM/IEEE international symposi 
Full text available:^ p t <f( \ ,15 **b} H Publisher Site Additional Information: full citation, abs 

Advanced microprocessors have been increasing clock rates, well beyond the < 
performance microprocessors, a small and fast data micro cache (ucache) is ir 
proper management of it via load bypassing has a significant performance im| 
evaluate a hardware-software collaborative technique to manage ucache bypa 
supports the ucache bypassing with a flag in the load ... 



5 Instruction prefetching of systems codes with layout optimized for reduced < 

Chun Xia, Josep Torrellas 

May 1996 ACM SIGARCH Computer Architecture News , Proceedings of the 23rd a 

Computer architecture, Volume 24 Issue 2 
Full text available:"® pc!f(1 .85 MB) Additional Information: full citation, abstract, references, ci 

High-performing on-chip instruction caches are crucial to keep fast processors 
caches are usually successful at intercepting instruction fetches in loop-intens 
able to do so in large systems codes. To improve the performance of the latte 
out the code in memory for reduced cache conflicts. Interestingly, such an op< 
can be exploited by a new type of ... 

6 Memory and network optimization in embedded designs: Multi-profile base< 
E. Wanderley Netto, R. Azevedo, P. Centoducatte, G. Araujo 

June 2004 Proceedings of the 41st annual conference on Design automation 
Full text available: 1 ® pcf(272.41 KB} Additional Information: full citation, abstract, referem 

Code compression has been shown to be an effective technique to reduce cod< 
systems. It has also been used as a way to increase cache hit ratio, thus redu 
performance. This paper proposes an approach to mix static/dynamic instructi 
so as to best exploit trade-offs in compression ratio/performance. Compressec 
indices into fixed-size codewords, el ... 

Keywords: code compression, code density, compression 



2 of 6 



9/8/04 1:03 PM 



Results (page 1): compiler-directed inst...<and> code instourrentation <id«Jt£|^ 

7 Reducing cache misses using hardware and software page placement 
Timothy Sherwood, Brad Calder, Joel Emer 

May 1999 Proceedings of the 13th international conference on Supercomputing 
Full text available: 1 ® pdf{ 1 .50 MB) Additional Information: f ull citation, refer ences, citings, index terms 



8 Cache-conscious data placement 

Brad Calder, Chandra Krintz, Simmi John, Todd Austin 

October 1998 Proceedings of the eighth international conference on Architectural « 

operating systems, Volume 33 , 32 Issue 11 , 5 
Full text available:® pdf(1.48 MB) Additional Information: full citation, abstract, references, ci 

As the gap between memory and processor speeds continues to widen, cache 
component of processor performance. Compiler techniques have been used to 
by mapping code with temporal locality to different cache blocks in the virtual 
conflicts. These code placement techniques can be applied directly to the prob 
cache pedormance.In this paper we present a gene ... 

9 Compiler and hardware support for cache coherence in large-scale multipn 
performance study 

Lynn Choi, Pen-Chung Yew 

May 1996 ACM SIGARCH Computer Architecture News , Proceedings of the 23rd a 

Computer architecture, Volume 24 Issue 2 
Full text available:"® priff 1 .48 MB) Additional Information: full citation, abstract, reference 

In this paper, we study a hardware-supported, compiler directed (HSCD) each 
implemented on a large-scale multiprocessor using off-the-shelf microprocess 
adapted to various cache organizations, including multi-word cache lines and 
system related issues, including critical sections, inter-thread communication, 
addressed. The cost of the required hardware sup ... 

10 Physical Experimentation with Prefetching Helper Threads on Intel's Hyper- 
Dongkeun Kim, Steve Shih-wei Liao, Perry H. Wang, Juan del Cuvillo, Xinmin Ti 
Yeung, Milind Girkar, John P. Shen 

March 2004 Proceedings of the international symposium on Code generation and i 

runtime optimization 
Full text available:® pdf(264.47 K8) Additional Information: full citation, abstr 

Pre-execution techniques have received much attention as aneffective way of 
ever-increasingmemory latency. A number of pre-execution techniquesbased ■ 
been proposed andstudied extensively by researchers. They report promising 
Simultaneous Multithreading (SMT)processor. In this paper, we apply the help 
multithreaded machine, i.e., Intel Pentium 4 processor withHyp ... 
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11 Compiler-directed run-time monitoring of program data access 
Chen Ding, Yutao Zhong 

June 2002 ACM SIGPLAN Notices , Proceedings of the workshop on Memory systei 
supplement 

Full text available: D pcf(1 .40 MB) Additional Information: full citation, abstract, referer 

Accurate run-time analysis has been expensive for complex programs, in part 
data. Some applications require only partial reorganization. An example of thi: 
from a mobile device. Complete monitoring is not necessary because not all a 
support partial monitoring, this paper presents a framework that includes a sc 
run-time monitor. The compiler inserts ru ... 

12 Trace-driven memory simulation: a survey 

Richard A. Uhlig, Trevor N. Mudge 

June 1997 ACM Computing Surveys (CSUR), Volume 29 Issue 2 

Full text available:!! pdf{836.1 1 KB) Additional Information: full citation, abstract, references, citin 

As the gap between processor and memory speeds continues to widen, metho 
designs before they are implemented in hardware are becoming increasingly i 
trace-driven memory simulation, has been the subject of intense interest amc 
enjoyed rapid development and substantial improvements during the past dec 
these developments by establishing criteria for evaluating trac ... 

Keywords: TLBs, caches, memory management, memory simulation, trace-dri 



13 Using generational garbage collection to implement cache-conscious data | 

Trishul M. Chilimbi, James R. Larus 

October 1998 ACM SIGPLAN Notices , Proceedings of the first international sympo! 
34 Issue 3 

Full text available:® pcf(1 .20 MB) Additional Information: full citation, abstract, references, ci 

The cost of accessing main memory is increasing. Machine designers have trie 
processor and memory technology trends underlying this increasing gap with 
tolerate memory latency. These techniques, unfortunately, are only occasiona 
programs. Recent research has demonstrated the value of a complementary a 
structures are reorganized to improve cache loca ... 

Keywords: cache-conscious data placement, garbage collection, object-oriente 



14 Cache coherence in large-scale shared-memory multiprocessors: issues ai 

David J. Lilja 

September 1993 ACM Computing Surveys (CSUR), Volume 25 Issue 3 

Full text available:® pdf(3. 12 MB) Additional Information: full citation, references, citings, index terrr 
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15 Memory data organization for improved cache performance in embedded p 
Preeti Ranjan Panda, Nikil D. Dutt, Alexandru Nicolau 

October 1997 ACM Transactions on Design Automation of Electronic Systems (TC 
Full text available: 1 ® prif(884.55 KB) Additional Information: full citation, abstract, references, 

Code generation for embedded processors opens up the possibility for several 
that have been ignored by traditional compilers due to compilation time const 
into account the parameters of the data caches for organizing scalar and arrav 
into memory, with the objective of improving data cache performance. We pre 
to minimize compulsory cache misse ... 

Keywords: cache memory, data cache, memory synthesis, system design, sys 



16 Cache-conscious structure definition 
Trishul M. Chilimbi, Bob Davidson, James R. Larus 

May 1999 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAIM 1999 conferei 

and implementation, Volume 34 Issue 5 
Full text available:!! pdf(1 .30 MB) Additional Information: full citation, abstract, references, ci 

A program's cache performance can be improved by changing the organizatior 
pointer-based data structures. Previous techniques improved the cache perfor 
distinct instances to increase reference locality. These techniques produced si* 
but worked best for small structures that could be packed into a cache block.7 
concentrating on the internal organization of f ... 

Keywords: cache-conscious definition, class splitting, field reorganization, stn 



17 Compiler transformations for high-performance computing 
David F. Bacon, Susan L. Graham, Oliver J. Sharp 

December 1994 ACM Computing Surveys (CSUR), Volume 26 Issue 4 

Full text available:!! pdf(6.32 MB) Additional Information: full citation, abstract, references, citings 

In the last three decades a large number of compiler transformations for optir 
implemented. Most optimizations for uniprocessors reduce the number of insti 
transformations based on the analysis of scalar quantities and data-flow techr 
high-performance superscalar, vector, and parallel processors maximize parall 
transformations that rely on tracking the properties o ... 

Keywords: compilation, dependence analysis, locality, multiprocessors, optimi 
processors, vectorization 
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18 Predictability of load/store instruction latencies 

Santosh G. Abraham, Rabin A. Sugumar, Daniel Windheiser, B. R. Rau, Rajiv Gu 
December 1993 Proceedings of the 26th annual international symposium on Micro; 

Full text available:® pdf(1.51 MS) Additional Information: full citation, references, citings 



19 Memory system performance of programs with intensive heap allocation 

Amer Diwan, David Tarditi, Eliot Moss 

August 1995 ACM Transactions on Computer Systems (TOCS), Volume 13 I< 
Full text available:® pdf{2.10 MB) Additional Information :full citation, abstract, references, ci 

Heap allocation with copying garbage collection is a general storage managenr 
languages. It is believed to have poor memory system performance. To invest 
study of the memory system performance of heap allocation for memory systc 
studied the performance of mostly functional Standard ML programs which mc 
found that most machines support heap allocation poorly. Howeve ... 

Keywords: automatic storage reclamation, copying garbage collection, garbag- 
collection, heap allocation, page mode, subblock placement, write through, wr 
write-policy 



20 The Wisconsin Wind Tunnel: virtual prototyping of parallel computers 

Steven K. Reinhardt, Mark D. Hill, James R. Larus, Alvin R. Lebeck, James C. Le 
June 1993 ACM SIGMETRICS Performance Evaluation Review , Proceedings of the 
Measurement and modeling of computer systems, Volume 21 Issue 1 

Full text available: H ptif(1 .40 MB) Additional Information: full citation, references, citings, 
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